Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: fix the Round-Robin algorithm when "ReadMode == ReplicaReadMixed". #663

Merged
merged 2 commits into from
Jan 11, 2023

Conversation

LykxSassinator
Copy link
Contributor

Signed-off-by: Lucasliang nkcs_lykx@hotmail.com

Bug report

When I was doing relevant optimization on ReadMode, I found there existed a bug on Round-Robin strategy when ReadMode == ReplicaReadMixed.

Supposing that we have a TiKV cluster with 3 nodes - [node 1, node 2, node 3].
Firstly, the cluster keeps in normal state, the Read flows sent to each nodes are in Uniform Distribution.
And if we made one node abnormal, taking Node 1 as the choice, all flows sent to Node 1 would be redirected to Node 3.

With current Round-Robin strategy

//       [Node 1]                                         [Node 1]
//      /        \                                       /        \
//     /          \          [Node 2] abrnomal ===>     /          \  
//    /            \                                   /            \
// [Node 2] --- [Node 3]                            [Node 2] ---> [Node 3]
//                                                     [x]

Because the following steps:

  • The choice of state.lastIdx is randomly generated ==> Could be any one of [node 1, node 2, node 3].
    if state.lastIdx < 0 {
    if state.tryLeader {
    state.lastIdx = AccessIndex(rand.Intn(len(selector.replicas)))
  • if the previous step set state.lastIdx == [Node 2], it will be filtered by isCandidate.
    for i := 0; i < len(selector.replicas) && !state.option.leaderOnly; i++ {
    idx := AccessIndex((int(state.lastIdx) + i) % len(selector.replicas))
    if state.isCandidate(idx, selector.replicas[idx]) {
  • As the checking always starts from i == 0 and state.lastIdx + i, so the first choice, that is [Node 2], will be filtered. And then loops with i == 1, getting the next choice [Node 3], it's a normal node, then exit.
  • Finally, all flows sent to the abnormal [Node 2] will be redirect to [Node 3], making the flows do not meet Uniform Distribution as expected.

And we made a test, the following CPU metrics of TiKV cluster proved it:

image

What we expect

What we expect is that, if Node 2 is abnormal, all flows originally sent to this node should be uniformly redirected to other nodes.

Solution

I made a minor optimization on the original Round-Robin strategy as the following shows. After we try to filter out abnormal node choice, we should randomly choose one which meets the requirements.

for cnt := 0; cnt < replicaSize && !state.isCandidate(idx, selector.replicas[idx]); cnt++ {
	idx = AccessIndex((int(idx) + rand.Intn(replicaSize)) % replicaSize)
}

And after this supplementary strategy is introduced, we get the expected results:

image

Signed-off-by: Lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: Lucasliang <nkcs_lykx@hotmail.com>
@LykxSassinator
Copy link
Contributor Author

@youjiali1995 PTAL, thx

Copy link
Collaborator

@sticnarf sticnarf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sticnarf sticnarf merged commit c598334 into tikv:master Jan 11, 2023
@LykxSassinator LykxSassinator deleted the fix_round_robin branch January 11, 2023 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants